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Abstract 

We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate 
the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database 
and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is 
an explicit bound to the VC-dimension of a range space defined by all possible outcomes of a collection (class) of 
queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection 
predicate and of the maximum number of select and join operations in any individual query in the collection, but it 
is neither a function of the number of queries in the collection nor of the size (number of tuples) of the database. 
We leverage on this result and develop a method that, given a class of queries, builds a concise random sample of a 
database, such that with high probability the execution of any query in the class on the sample provides an accurate 
estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for 
the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity 
of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample 
representation computed by our method is typically sufficiently small to be stored in main memory. We present 
extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique 
when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server 



1 Introduction 

As advances in technology allow for the collection and storage of vast databases, there is a growing need for ad- 
vanced machine learning techniques for speeding up the execution of queries on such large datasets. In this work 
we focus on the fundamental task of estimating the selectivity, or output size, of a database query, which is a crucial 
step in a number of query processing tasks such as execution plan optimization and resource allocation in paral- 
lel and distributed databases. The task of efficiently obtaining such accurate estimates has been extensively stud- 
ied in previous work with solutions ranging from storage of pre-computed statistics on the distributi on of va l ues in 
the tables, to online sampling of the databases, and to combinations of the two approa ches liGanguly et al.l 1996 : 



[(Ganguly 

Gantietal., 2000; Gibbons and Matias, 1998; Haas and Swami, 1992,J995; Hou et all [19911 1 19881; Larson et al 



20071; iLipton and Naughtonl 1 19951; iLipton et all Il990l; iPoosala and loannidisi Il997ll . Histograms, simple yet pow- 



erful statistics of the data in the tables, are the most commonly used solution in practice, thanks to their computational 
and space efficiency. However, there is an inherent limitation to the accuracy of this approach when estimating the 
selectivity of queries that involve either multiple tables/columns or correlated data. Running the query on freshly 
sampled data gives more accurate estimates at the cost of delaying the execution of the query while collecting random 
samples from a disk or other large storage medium and then performing the analysis itself. This approach is therefore 
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usually more expensive than a histogram lookup. Our goal in this work is to exploit both the computational efficiency 
of using pre-collected data and the provable accuracy of estimates obtained by running a query on a properly sized 
random sample of the database. 



We apply the statistical concept of VC-dimension llVapnik and Chervonenkislll971ll to develop and analyze a novel 



technique to generate accurate estimates of query selectivity. Roughly speaking, the VC-dimension of a collection of 
indicator functions (hypotheses) is a measure of its complexity or expressiveness (see Sect. 13.11 for formal defini- 
tions). A major theoretical contribution of this work, which is of independent interest, is an explicit bound to the 
VC-dimension of any class of queries, viewed as indicator functions on the Cartesian product of the database tables. 
In particular, we show that the VC-dimension of a class of queries is a function of the maximum number of Boolean, 
select and join operations in any query in the class, but it is not a function of the number of different queries in the 
class. By adapting a fundamental result from the VC-dimension theory to the database setting, we develop a method 
that for any class of queries, defined by its VC-dimension, builds a concise sample of the database, such that with 
high probability, the execution of any query in the class on the sample provides an accurate estimate for the selectivity 
of the query on the original large database. The error probability holds simultaneously for the selectivity estimate of 
all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the 
sample needs to be refreshed only following major changes in the database. The size of the sample does not depend 
on the size (number of tuples) in the database, just on the complexity of the class of queries we plan to run, measured 
by its VC-dimension. Both the analysis and the experimental results show that accurate selectivity estimates can be 
obtained using a sample of a surprising small size (see Table [1] for concrete values), which can then reside in main 
memory, with the net result of a significant speedup in the execution of queries on the sample. 

A technical difficulty in applying the VC-dimension results to the database setting is that they assumes the avail- 
ability of a uniform sample of the Cartesian product of all the tables, while in practice it is more efficient to store a 
sample of each table separately and run the queries on the Cartesian product of the samples, which has a different 
distribution than a sample of the Cartesian product of the tables. We develop an efficient procedure for constructing a 
sample that circumvents this problem (see Sect.|5]l. 

We present extensive experimental results that validate our theoretical analysis and demonstrate the advantage of 
our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft 
SQL Server The main advantage of our method is that it gives provably accurate predictions for the selectivities 
of all queries with up to a given complexity (VC-dimension) specified by the user before creating the sample, while 
techniques like multidimensional histograms or join synopses are accurate only for the queries for which they are built. 

Note that we are only concerned with estimating the selectivity of a query, not with approximating the query 
answer using a sample of the database (Das l!2009ll presents a survey of the possible solutions to this latter task). 

Outline. The rest of the paper is organized as follows. We review the relevant previous work in Sect.|2l In Sect. Owe 
formulate the problem and introduce the Vapnik-Chervonenkis dimension and the related tools we use in developing 
our results. Our main analytical contribution, a bound on the VC dimension of class of queries is presented in Sect.|4] 
The application of these results for selectivity estimation is given in Sect.|5] Experiments are presented in Sect.|6] 



2 Related Work 



Methods to estimate the selectivity (or cardinality of the output) of queries have been extensively studied in the 
database literature primarily due to the importance of this task to query plan optimization and resource alloca- 
tion. A variety of approaches have been explored, ranging from the use of sampling, both online and offline, to 
the pre-computation of different statis t ics su ch as histograr ns, to the application o f methods fror n machine learn- 



199711 . data minin g |Grvz and Liang, 2004], optimization IChaudhuri et al. 



ing [Chen et al., 1990; Harangsri et al 

12007; Markl et al.„ ,2007] , and probabihstic modehng iGetoor et al.. 2001; Re and Suciu. 2010 1 . 

T he use of sampling f or selectivity estimation has been studied mainly in the context of online samphng j Lip ton and NaughtonL 

1995l:lLipton et allll990ll . where a sample is obtained, one tuple at a time, after the arrival of a query and it used only 



to evaluate the selectivity of that query and t hen discarded. Samp l ing at random frorn a large database residing on disk 
is an expensive operation BBrown and Haasl 120061: iGemulla et al.l [ |2006l: l lOlkenll 199311 . an d in some cases sampling for 
an accurate cardinality estimate is not significantly faster than full execution of the query iHaas et all 1 19931 1199411 . 
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A variety of sampling and statistical analysis techniques has been tested to improve the efficiency of the sam- 
pling procedures and in parti cular to identify early stopping conditions. These include sequential sampling analy- 
sis [H aas and Swami, 1992; H ou et al.L lT991]. keeping additional statistics to i mprove the estim ation [Haas and Swam! 
I1995I1 . labelling the tuples and using label-dependent estimation proc edures llGangulv et all Il996ll . or applying the 



cumulativ e distributi on function inversion procedure [|Wu et aL , 200 1*1. Some work als o looked at nonuniform sam 



pling IBab cock et al ., 2003; Estan and Naughton, 200^ and stratified sampling 1 Chaudh uri et al II2OO7 ; Joshi and Jermaind 
I2OO8II . Despite all these relevant contributions, online sampling is still considered too expensive for most applications. 
An offline sampling approach was explored by Ngu et al. 1200 41. who used systematic sampling (requiring the tuples 
in a table to be sorted according to one of the attributes) with a sample size dependent on the number of tuples in the 
table. The paper does not give any explicit guarantee on the accuracy of their predictions. Chaudhuri et al. i2007ll 
present an approach which uses optimization techniques to identify suitable strata before sampling. The obtained 
sample is such that the mean square error in estimating the selectivity of queries belonging to a given workload is 
minimized, but there is no quality guarantee on the maximum error Haas il996tl developed Hoeffding inequalities to 
bound the probability that the selectivity of a query estimated from a sample deviates more than a given amount from 
its expectation. However, to estimate the selectivity for multiple queries and obtain a given level accuracy for all of 
them, simultaneous statistical inference techniques hke the unio n bound should be used, which are known to be overly 
conservative when the number of queries is large llMillen,ll98lll . On the contrary, our result will hold simultaneously 
for all queries within a given complexity (VC dimension). 

A technical problem arises when combining join operations and sampling. As pointed out by Chaudhuri et al. il999ll . 
the Cartesian product of uniform samples of a number of tables is different from a uniform sample of the Cartesian 
product of those tables. Furthermore, given a size s, it is impossible to a priori determine two sample sizes si and S2 
such that uniform samples of these sizes from the two tables will give, when joined together along a common column, 
a sample of the join table of size s. In Sect. |5]we explain why only the first issue is of concern for us and how we 
circumvent it. 

In practice most database systern s use pre-computed statistics to predict qu ery selectivity IGanti et al. . I2OOOI: 
Gibbons and Matiasi 119981: IHou et"all Il988t Ijiii et al.L 120061: iLarson et all 1200711 . with histograms being the most 



commonly us ed representation. The construction, maintenance, and use of histograms were thoroughly e xamined in 



the literature jloannidis and Poosala. 1 19951: Ijagadish et all Il998l: iMatias et all Il998l: IPoosala et all Il996ll , with both 



theoretical and experimental results. In particular Chaudhuri et al. 1 1998ll rigorouslv evaluated the size of the sample 
needed for building a histogram providing good estimates for the selectivities of a large group of (select only, in their 
case) queries. Kaushik et al. I2005il extensively compared histograms and sampling from a space complexity point 
of view, although their sample-based estimator did not offer a uniform probabilistic guarantee over a set of queries 
and they only consider the case of foreign-key equijoins. We address both these points in our work. Although very 
efficient in terms of storage needs and query time, the quality of estimates through histograms is inherently limited 
for complex queries because of two major drawbacks in the use of histograms: intra-bucket uniformity assumption 
(i.e., assuming a uniform distribution for the frequencies of values in the same bucket) and inter-column independence 
assumption (i.e., assuming no correlation between the values in different columns of the same or of different tables). 
Differ ent authors suggested solutions to improve the estimation of selectivity withou t making the ab ove assump 
tions BBruno and Chaudhuril 120041 : iDobral 120051: iPoosala and loannidisl 119971: IWang an d Sevci k. l2003t I Wang et al 



1997 1 . Among these s olutions, the use of mul tidimensional histograms 1 Br uno et all uOOlt IPoosala and loannidii 



1997; Srivastava et al. . I2OO6I: IWang and Sevcikl 120031 seems the most practical. Nevertheless, these techniques are 



not widespread due to the extra memory and computational costs in their implementation. 

Efficient and practical techniques for drawing random samples from a database and f or updating the sample 
when the underlying tables evolve have been extensively analyzed in the database literature IBrown and Haas , I2OO6I: 
GemuUa et alll2006ll2007tlHaas and Konigll2004lJermaine et alll2004 1. 



The Vapnik-Chervonenkis dimension was first introduced in a seminal article I Vapnik and Chervonenki4 1971 1 
on the conve rgence of probability distributions, but it was only with the work of Haussler and Welzl Ill986r and 
Blumer et al. 1 1989ll that it was applied to computational sampling and learning theory. Si nce then, VC-dimension has 
encountered enormous succe ss and application in the fie lds of computational geometry IChazelle , I2OOOI: iMatouseld 
200211 and machine learning lAnthonv and Bartlett , I1999II but its use in system-related fields is not as widespread. In 



the database literature, it was used in the context of constraint databases to compute good approximations of aggre- 
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gate operators IBenedikt and Libkin , l2002ll . VC- dimension-related results were also recently applied in the field of 
database privacy by Blum, Ligett, and Roth to show a bound on the number of queries needed for an attacker 

to learn a private concept in a database. Gross- Amblard [2011] showed that content with unbounded VC-dimension 
can not be watermarked for privacy purposes. 

To the best of our knowledge, our work is the first to provide explicit bounds on the VC-dimension of queries and 
to apply the results to query selectivity estimation. 



3 Preliminaries 

We consider a database V of k tables 7i, • • • ,71-. We denote a column C of a table T as T.C and, for a tuple 
t G T, the value of t in the column C as t.C. We denote the domain of the values that can appear in a column T-C 
as D{T.C). Our focus is on queries that combine select and join operations, defined as follows. We do not take 
projection operations into consideration because their selectivities have no impact on query optimization. 

Definition 1. Given a table T with columns T.Ci , • ■ • , T.Cf, a selection query q on T is an operation which returns 
a subset S of the tuples of T such that a tuple tofT belongs to S if and only if the values in t satisfy a condition C (the 
selection predicate) expressed by q. In full generality, C is the Boolean combination of clauses of the form T-Ci op a^, 
where T-Ci is a column of T, "op" is one of {<, >, >, <, =, ^} and a,; is an element of the domain of T-Ci. 

We assume that all D{T-Ci) are such that it is possible to build total order relations on them. This assumptions 
does not exclude categorical domains from our discussion, because the only meaningful values for "op" for such 
domains are "=" and "7^", so we can just assume an arbitrary but fixed order for the categories in the domain. 

Definition 2. Given two tables 71 and T2, a join query q on a common column C (i.e. a column present both in 7i 
and T2) is an operation which returns a subset of the Cartesian product of the tuples in Ti and 72- The returned subset 
is defined as the set 

{ih,t2) : h e Ti,h e Ta, s.t. h.Coph.C} 

where "op" is one of {<, >, >, <, =, ^}. 



Our definition of a join query is basically equivalent to that of a theta-join IGarcia-Molina et aU 120021 Sect. 5. 2. 7], 
with the limitation that the join condition C can only contain a single clause, i.e. a single condition on the relationship 
of the values in the shared column C and only involve the operators {<,>,>,<,=, 7^} (with their meaning on D{C)). 
The pairs of tuples composing the output of the join in our definition have a one-to-one correspondence with the tuples 
in the output of the corresponding theta-join. 

Definition 3. Given a set of £ tables 7i, • • • ,Te, a combination of select and join operations is a query returning a 
subset of the Cartesian product of the tuples in the sets 5*1, •• • , S'^, where Si is the output of a selection query on %■ 
The returned set is defined by the selection queries and by a set of join queries on 5*1, . . . , 5^. 

Definition 4. Given a query q, a query plan for g is a directed binary tree Tq whose nodes are the elementary (i.e. 
select or join) operations into which q can be decomposed. There is an edge from a node a to a node h if the output of 
a is used as an input to b. The operations on the leaves of the tree use one or two tables of the database as input. The 
output of the operation in the root node of the tree is the output of the query. 

It follows from the definition of a combination of select and join operations that a query may conform with multiple 
query plans. Nevertheless, for all the queries we defi ned there is (at least) one qu ery plan such that all select operations 



are in the leaves and internal nodes are join nodes BGarcia-Molina et aU 1200211 . To derive our results, we use these 
specific query plans. 

Two crucial definitions that we use throughout the work are the cardinality of the output of a query and the 
equivalent concept of selectivity of a query. 

Definition 5. Given a query q and a database V, the cardinality of its output is the number of elements (tuples if q is a 
selection queries, pairs of tuples if q is a join query, and ^-uples of tuples for combinations of join and select involving 
t tables) in its output, when run on V. The selectivity a{q) of q is the ratio between its cardinality and the product of 
the sizes of its input tables. 
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Our goal is to store a succint representation (sample) S of the database V such that an execution of a query on the 
sample S will provide an accurate estimate for the selectivity of each operation in the query plan when executed on 
the database V. 



3.1 VC-Dimension 

The Vapnik-Chernovenkis (VC) Dimension of a family of indicator functions (or equivalently a fa mily of subsets) on a 



space of points is a measure of the complexity or expressiveness of set of functions in that structure | Vapnik and Chervonenkisi 



I97I . A finite bound on the VC-dimension of a structure implies a bound on the number of random samples required 



for approximately learning that structure. We outline here some basic definitions and results and their adaptation to the 
specific setting of que ries. W e refer the reader to the works of Alon and Spencer |2008, Sect. 14.4], Chazelle 120001 



Chap. 4], and Vapnik 11 199911 for an in-depth discussion of the VC-dimension theory. 



VC-dimension is defined on range spaces: 

Definition 6. A range space is a pair {X, R) where X is a (finite or infinite) set and i? is a (finite or infinite) family 
of subsets of X. The members of X are called points and those of R are called ranges. 

In our setting, for a class of select queries Q on a table T, X is the set of all tuples in the input table, and R the 
family of the outputs (as sets of tuples) of the queries in Q when run on T. For a class Q of queries combining select 
and join operations, X is the Cartesian product of the associated tables and R is the family of outcomes of queries 
in Q, seen as £-uples of tuples, if £ tables are involved in the queries of Q. When the context is clear we identify the 
family R with a class of queries. 

To define the VC-dimension of a range space we consider the projection of the ranges into a set of points: 

Definition?. Let (X, i?) be a range space and A C X. The prq/'ecf/on of i? on A is defined as Pr( A) — {rOA : r £ 
R}. 

A set is said to be shattered if all its subsets are defined by the range space: 

Definition 8. Let {X, R) be a range space and A C X.lf \Pji{A) \ — 2^, then A is said to be shattered by R. 

The VC-dimension of a range space is the cardinality of the largest set shattered by the space: 

Definition 9. Let S = {X, R) be a range space. The Vapnik-Chervonenkis dimension (or VC-dimension) of S, denoted 
as VC{S) is the maximum cardinality of a shattered subset of X. If there are arbitrary large shattered subsets, then 
VC{S) = 00. 

When the ranges represent all the possible outputs of queries in a class Q applied to database tables T), the VC- 
dimension of the range space is the maximum number of tuples such that any subset of them is the output of a query 
in Q. 

Note that a range space (X, R) with an arbitrary large set of points X and an arbitrary large family of ranges R can 
have a bounded VC-dimension. A simple example is the family of intervals in [0, 1] (i.e. X is all the points in [0, 1] and 
R all the intervals [a, h], such that < a < < 1). Let A = {x, y, z} be the set of three points 0<a;<y<z<l. 
No interval in R can define the subset {x, 0} so th e VC-dimension of this range space is < 3. This observation is 
generalized in the following result [Matguse M. I2OO2I Lemma 10.3.1]: 



Lemma 1. The VC-Dimension of the range space {W^ , X), where X is the set of all half-spaces in equals d + 1. 

The main application of VC-dimension in statistics and learning theory is its relation to the minimum size sample 
needed for approximate learning of a set of indicator functions or hypothesis. 



Definition 10. Let (X, R) be a range space and let A be a finite subset of X. 

1. For < e < 1, a subset i? C A is an e-approximation for A in (X, R) if Vr G R, we have 



L4nr| ISnr 



< £. 
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2. For < p, £ < 1, a subset B C A is a relative {p, e)-approximation for A in {X, R) if for any range r £ R 

\Anr\ \Bnr\ 

~\A\ TBT 



such that ^-j^ > p we have 



— ^ \A\ f*^'" ™y range r e i? such that ' | ^ | ' < p we have 



l-Bnr 



< il + e)p. 



s>mm\\A\,— [d + log-)\ , (1) 



An £-appro ximation (resp. a relative (jP, £)-a pproxima tion) can be pr obabiHsti cally c onstructed by sampling the 
point space [Har-Peled and Sharir , 201 1 ; Li et all ,2001; Vapnik and Chervonenkist Il971 ] . 

Theorem 1. Let {X, R) be a range space of VC-dimension at most d, A a finite subset of X, and B G Aa random 
sample of A of cardinality s. 

1. There is a positive constant c such that for any < £, 5 < 1 and 

B is an e-approximation for A with probability at least 1 — 5. 
There is a positive constant d such that for any < p < 1 and 

2 1 ~ -o - + log - 

B is a relative (p, e)-approximation for A with probability at least 1 — 5. 

Loffler and Phillips |2009] showed experimentally that the constant c is approximately 0.5. It is also inte resting to 
note t hat an £-approximation of size O ( 4- log - ) can be built deterministically in time 0{d^'^ ( \ log - Y\X\) llChazelle , 
2OOOII . 



s > min \ \A\, — ( dlog - + log ■ 



In Sect. |5] we use an £-approximation (or a relative {p, £)-approximation) to compute good estimates of the selec- 
tivities of all queries in Q. We obtain a small approximation set through probabilistic construction. The challenge in 
applying Thm.[T]to our setting is computing the VC-dimension of a range space defined by a class of queries. We state 
here a few fundamental results that will be used in our analysis. 

First, it is clear that if the VC-dimension of a range space {X, R) is d then 2*^ < \R\ since all subsets of some set 
of size d are defined by members of R. Next we define for integers n > and d > the growth function g{d, n) as 

d 

aid, n) ^ 

1=0 

The growth function is used in the following results jAlon and Spencer , 2008 , Sect. 14.4]. 

Lemma 2 (Sauer's Lemma). If {X, R) is a range space of VC-dimension d with \X\ — n points, then \R\ < g{d, n). 

Corollary 1. If{X, R) is a range space of VC-dimension d, then for every finite A <Z X, |Pr(^)| < g{d, \A\). 

Our analysis in Sect.|4]uses the following bound which is an extension of I Alon and Spence3,l2008i Corol. 14.4.3] 
to arbitrary combinations of set operations. 

Lemma 3. Let [X, R) be a range space of VC-dimension d > 2 and let {X, Rh) be the range space on X in which Rh 
include all possible combinations of union and intersections ofh members of R. Then VC{X, < idh \og{dh). 

Proof. Let A be an arbitrary subset of cardinality n of X. We have |Pfl(yl.)| < g{d, n) < n'^. There are 



(\Pr{A)\\ ^ fg{d,n) 



dh 



possible choices of h members of Pii{A), and there are no more than 

Boolean combinations using unions and intersections of the h sets, where Ck-i is the {h — 1)*^ Catalan number 

2" > h^^n'^'' > \Pr,SA)\ 

then A cannot be shattered. This inequality holds for n > 3dh \og{dh) □ 
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4 The VC-dimension of Classes of Queries 



In this section we develop a general bound on the VC-dimension of classes of queries. We start by computing the VC- 
dimension of simple select queries on one column and then move to more complex types of queries (multi-attributes 
select queries, join queries). We then extend our bounds to general queries that are combinations of multiple select 
and join operations. 

4.1 Select Queries 

Let Tbe a table with rn columns T.Ci, . . . , T-Cm, and n tuples. For a fixed column T.C, consider the set Ec of the 
selection queries in the form 

SELECT * FROM T WHERE T-CiOpa (2) 

where op is an inequality operator (i.e., either ">" or "<"11] and a e D(T-C). 

Let qi,q2 € Sc be two queries. We say that qi is equivalent to q2 (and denote this fact as qi = q2) if their outputs 
are identical, i.e., they return the same set of tuples when they are run on the same database. Note that qi = q2 defines 
a proper equivalence relation. 

Let EJ, C Ep be the maximum subset of Ec that contains no equivalent queries, i.e. it contains one query from 
each equivalent class. 

Lemma 4. Let T be a table with m columns d, I < i < m, and consider the set of queries 

m 
1=1 

where E^. is defined as in the previous paragraph. Then, the range space S — (T, E^) has VC-dimension at most 
m + 1. 

Proof. We can view the tuples of T as points in the m-dimensional space A — D{T.Ci) x D{T.C2) x • ■ • x 
D{T.Cm). A tuple t such that t.Ci = ai,t.C2 = a2, ■ ■ ■ , t.Cm — is represented on the space by the point 
(fli, 02, • • • , a„i). 

The queries in E^ can be seen as half spaces of A. In particular any query in E^ is defined as in (|2|l and can 
be seen as the half space {(xi, •• • ,Xi,--- ,Xm) ■ Xj E D{Tj)foij ^ i, andxiOpa} C A. It then follows from 
LemmalHthat VC{S) <m + l. □ 

We now extend these result to general selection queries. Consider the set E of queries whose selection predicate 
can be expressed as the Boolean combination of the selection predicates of at most two queries from E^. These are 
the queries of the form: 

SELECT * FROM T WHERE T.^i op^ ai bool 7'.X2 02 

where T.Xi and T-X2 are two columns from T (potentially, T-Xi = T.X2), oi G D{T.Xi), 02 G D{T.X2), "op," 
is either ">" or "<" and "bool" is either "AND" or "OR". Note that, in particular the queries in the form 

SELECT * FROM T WHERE T.^iCqopa 

where eqop is either or "7^", belong to E^ because we can rewrite a selection predicate containing one of these 
operators as a selection predicate of two clauses using ">" and "<" joined by either AND (in the case of "=") or OR 
(in the case of "^"). 

By applying Lemma[3] we have that the VC-dimension of the range space (T, E^) is at most 2(m + 1)2 log((m + 
1)2), where m is the number of columns in the table T. 

We can generalize this result to b Boolean combinations of selection predicates as follows. 

'The operators ">" and "<" can be reduced to ">" and "<" respectively. 
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Lemma 5. Let T be a table with m columns, letb>0 and let be the set of selection queries on T whose selection 
predicate is a Boolean combination ofb clauses. Then, the VC-dimension of the range space Sb — (7~, ) is at most 
3((to + 1)6) log((m + l)&). 

Note that we can not apply the bound used in the proof of Lemma |4] since not all queries in are equivalent 
to axis-aligned boxes. Once we apply Boolean operations on the outputs of the individual select operation, the set of 
possible outputs, Sb — (T, £7*) , may form complex subsets, including unions of disjoint (half-open) axis aligned- 
rectangles and/or intersections of overlapping ones that cannot be represented as a collection of half spaces. Thus, we 
need to apply a different technique here. 

Proof. The output of a query q in can be seen as the Boolean combination (i.e. union and intersection) of the 
outputs of at most b "simple" select queries qi from where each of these queries qi is as in (|2]i. An AND operation 
in the predicate of q implies an intersection of the outputs of the corresponding two queries qi and qj, while an OR 
operation implies a union of the outputs. By applying Lemma|3] we obtain the result. □ 

4.2 Join Queries 

Let 7i and T2 be two distinct tables, and let i?i and R2 be two families of (outputs of) select queries on the tuples of 
Ti and T2 respectively. Let Si = (Ti, S'2 = (7^, i?2) and let VC{Si),VC{S2) > 2. Let C be a column along 
which 7i and T2 are joined, and let Tj = 71 x 72 be the Cartesian product of the two tables. 
For a pair of queries ri G r2 E R2, let 

J?lr2 = {(^i'^2) : h e ri,t2 G r2,h.Copt2.C}, 

where op G {>,<,>,<, =.7^}. 7°^ ^^ is the set of ordered pairs of tuples (one from Ti and one from 72 that forms 
the output of the join query 

SELECT * FROM 7i,72 WHERE ri AND r2. (3) 
Here we simplify the notation by identifying select queries with their predicates. We have 7°^^^ ^ ri x r2 and 

JOp J_^gj 

Jc = {J?lr, kl ei?l,r2,i?2, op G {>,<,>, <,-,/}}. 

Jc i the set of outputs of all join queries like the one in ([3]i, for all pairs of queries in Ri x 7^2 and all values of "op". 
We present here an upper bound to the VC-dimension of the range space Sj = {Tj, Jc). 

Lemma 6. VC{Sj) < 3{VC{Si) + VC{S2)) \og{{VC{Si) + VC{S2))). 

Proof. Let vi = VC{Si) and V2 — VC{S2). Assume that a set A C Tj is shattered by Jc, and \A\ — v. Consider 
the two cross-sections Ai = {x G 7i : {x,y) G A} and A2 ^ {y E T2 ■ {x,y) G A}. Note that \Ai\ < v and 
IA2I < D and by[I]|PRi(Ai)| < < v"^ and \Pii^{A2)\ < g{v2,v) < v"\ For each set r G Pjci^) (i.e., 

for each subset r C A, given that Pj^{A) = 2^^) there is a pair (ri, r2), ri G T^i, r2 G 7i!2, and there is op, such 
that r = 7°P^^. Each of such pair (ri,r2) identifies a distinct pair (ri r\ Ai,r2 0^2) G P]i^{Ai) x Pr^{A2), 
therefore each element of Pr^ (Ai) x Pr^ (^2) can be identified at most 6 times, the number of possible values for 
"op". 

In particular, for a fixed "op", an element of Prj(Ai) x ^^2(^2) can be identified at most once. To see this, 
consider two different sets si, S2 G Pj^{A). Let the pairs (oi, 02), (&i, 62), ai, bi G Ri, a2, 62 G 7^2, be such that 
si = A n 7°P^, and S2 = A n 7^°^,,^. Suppose that ai n Ai = bi n Ai (G Pr^ (Ai)) and a2 n A2 = 62 n A2 
(G Pr2(A2)). The set si can be seen as {(^1,^2) : G ai fl Ai, ^2 G 02 n A2 s.t. ti.C opt2-C}. Analogously the 
set S2 can be seen as {(ti, ^2) : ii G &i n Ai, ^2 £ &2 n A2 s.t. ti.C opt2.C}. But given that aiCiAi = &i n Ai and 
a2 D A2 = 62 n A2, this leads to si — S2, a contradiction. Hence, a pair (ci, C2), ci G Pr^ i^i), C2 G Pr^ {A2) can 
only be identified at most 6 times, one for each possible value of "op". 

Thus,|Pj^(A)| < 6\Pr,{Ai)\ ■ |Pij2(^2)|- Acouldnotbe shattered if |Pjp( A) | < 2", i.e., if 

\Pja{A)\ < 6\PrAAi)\ ■ \PrM2)\ < 6g{vi,v}g{v2,v) < 6v-^+-^ < 2\ 
The rightmost inequality holds for v > 3(wi + V2) log(t;i + W2). □ 
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The above results can be generalized to any query plan represented as a tree where the select operations are in the 
leaves and all internal nodes are join operations. As we said earlier, such a tree exists for any query. 

Lemma 7. Consider the class Q of queries that can be seen as combinations of select and joins on u > 2 tables 
7i, . . . , 7^. Let Si — (7i, Ri), i — 1, . . . ,u be the range space associated with the select queries on the u tables. 
Let Vi = VC{Si). Let m be the maximum number of columns in a table %■ We assume m < fiJl Let Sq = 
(7i X • • • X Tu, Rq) be the range space associated with the class Q. The range set Rq is defined as follows. Let 
p — (ri, . . . , r„), ri G Ri, and let lo be a sequence ofu — 1 join conditions representing a possible way to join the u 
tables Ti, using the operators {>, <, >, <, =, ^}. We define the range 

Jp = {(^1' ■■ ■,tu) ■ U & ri, s.t. {ti, .. .,tu) satisfies uj}. 
Rq is the set of all possible J^. Then, 

VC{Sq) < AuC^VC{S,))\og{uY,VC{S^)). 

i i 

Note that this Lemma is not just an extension of Lemma|6]to join queries between multicolumns tables. Instead, it 
is an extension to queries containing multiple joins (possibly between multicolumns tables). 

Proof. Assume that a set A C Tj is shattered by Rq, and |^| = v. Consider the cross-sections Ai — {x ^ 
Ti ■■ (yi,.. . ,yi-i,x,y.i+i, ...,?/„) e A}, 1 < i < m. Note that \ Ai\ <v andby[T]|PR,(^i)| < g{v^,v) < v"' . For 
each set r G Pj^, {A) (i.e., for each subset r (- A, given that Pj^, (A) — 2^) there is a sequence p — [ri, . . . , ru), 
ri ^ Ri, and there is an u, such that r = A D Jp. Each sequence p identifies a distinct sequence (ri D Ai,r2 n 
A2, ■ . ■ ,ru n Au) € Pfli(v4i) X • • • X Pji^{Au), therefore each element of Pr-^{Ai) x • • • x Pn,^X^u) can be 
identified at most 6"^^ times, one for each different oj. 

In particular, for a fixed lj, an element of Pji-^ (Ai) x ■ • • x Pj^^ (Au) can be identified at most once. To see this, 
consider two different sets si,S2 G Pj^ (A). Let the vectors pa = (ai, . . . , pt = (61, ... , b^), ai, bi G Ri, be 
such that si — An Jp^ and S2 — Ar\ Jp^. Suppose that n Ai — biC] Ai (G Pr. [Ai]). The set si can be seen as 
{(ti, . . . ,tu) : ti G fli n Ai, s.t. [ti, . . . , tu) satisfies lo}. Analogously the set S2 can be seen as {{ti, . . . ,tu) : <i G 
bi n Ai, s.t. ifi, . . . , tu) satisfies w}. But given that n Ai = 5^0 Ai, this leads to si = S2, a contradiction. Hence, 
a vector (ci, . . . , Cu), Ci G Pr- (Ai), can only be identified at most i times, one for each different lu. For each of the 
M — 1 join conditions composing oj we need to choose a pair {Ti-A, Ti-B) expressing the columns along which the 
tuples should be joined. There are at most g = ("2") such pairs (some of them cannot actually be chosen, e.g. those 
of the type {Ti.A,Ti.B)). There are then {JL^ ways of choosing these u — 1 pairs. For each choice ofu—1 pairs, 
there are 6'"~^^ ways of choosing the operators in the join conditions (6 choices for op for each pair). We have 

Thus, \P.jc{A)\ < 1\Pr^{Ai)\ |PflJ^„)|. ^ could not be shattered if I Pjc(yl) I < 2", i.e. if 

\Pjc{A)\ < I ■ \PrAAi)\ \Pb.AAu)\ < e ■ g{vi,v)g{v2,v) . ..g{vu,v) << {muf^ ■ v^'+-+^- < 2\ 

The rightmost inequality holds for v > Au Vi) log(u ''^i)- ^ 

4.3 General Queries 

Combining the above results we prove: 

Theorem 2. Consider a class Qu,m.b of all queries with up to u — 1 join and u select operations, where each select 
operation involves no more than m columns and b Boolean operations, then 

VC{Qu,^,b) < I2u^{m + l)61og((TO + 1)6) \og{iu^{m + l)61og((m + l)b)). 

Note that Theorem |2] gives an upper bound to the VC-dimension. Our experiments suggest that in most cases the 
VC-dimension and the corresponding minimum sample size are even smaller 

-The assumption m < Vi is reasonable for any practical case. 
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5 Estimating Query Selectivity 



We applying the theoretical result on the VC-dimension of queries to constructing a concrete algorithm for selectivity 
estimation and query plan optimization. 

5.1 The general scheme 

Our goal is to apply Def.[TO]and Thm.[T]to compute an estimate of the selectivity of SQL queries. Let Qu.m,b be a class 
of queries as in Theorem|2] The class Qu,m.b defines a range space S — {X, R) such that X is the Cartesian product 
of the tables involved in executing queries in Qu.mfi, and R is the family of all output sets of queries in Qu.m,b- Let S 
be an e-approximation of X and let r be the output set of a query q £ Qu.m,b when executed on the original dataset, 
then X Cir = r and r n 5 is the output set when the query is executed on the sample (see details below). Thus, by 
Def.[TOl 

\Xr]r\ \sr\r\ 



\x\ 



Wv{q) - cfs{q)\ < e, 



i.e., the selectivity of running a query q G Qu.m,b on an e-approximation of X is within e of the selectivity of q when 
executed on the original set of tables. Note that for any execution plan of a query q G Qu,m.b, all the queries that 
correspond to subtrees rooted at internal nodes of the plan are queries in Qu.m,b- Thus, by running query q on an 
e-approximation of X we obtain accurate estimates for the selectivity of all the subqueries defined by its execution 
plan. Corresponding results are obtained by using a relative {p, e) -approximation for X. 

5.2 Building and using the sample representation 

We apply Thm. [T]to probabilistically construct an e-approximation of X. A technical difficulty in algorithmic appli- 
cation of the theorem is that it is proven for a uniform sample of the Cartesian product of all the tables in the database, 
while in practice it is more efficient to maintain the table structure of the original database in the sample. It is easier 
to sample each table independently, and to run the query on a sample that consists of subsets of the original tables 
rather than re-writing the query to run on a Cartesian product of tuples. However, the Cart esian product of indepen dent 
uniform samples of tables is not a uniform sample of the Cartesian product of the tables I Chaudhuri et al. . 1999ll . We 



developed the following procedure to circumvent this problem. Assume that we need a uniform sample of size t from 
D, which is the Cartesian product of £ tables Ti, ■ ■ ■ We then sample t tuples uniformly at random from each table 
Ti, to form a sample table Si. We add an attribute sampleindex to each Si and we set the value in the added attribute 
for each tuple in Si to a unique value in [l,t]. Now, each sample table will contain t tuples, each tuple with a different 
index value in [1, t]. Given an index value i G [1, t], consider the set of tuples Xi ~ {xi, . . . , Xi}, xj G Si such that 
xi. sampleindex — X2-sampleindex = ■ ■ ■ — x£. sampleindex — i. Xi can be seen as atuple sampled from D, and 
the set of all Xi, i G [1, is a uniform random sample of size t from D. We run queries on the sample tables, but in 
order to estimate the selectivity of a join operation we count a tuple Y in the result only if the set of tuples composing 
y is a subset of Xi for some i G [1, i]. This is easily done by scanning the results and checking the values in the 
sampleindex columns (see Algorithms [T] and |2]i. 

Lemma 8. The CompnteSelect±v±ty procedure (inAlg.\2} executes a query on the Cartesian product of independent 
random samples of the tables but outputs the selectivity that corresponds to executing the query on a random sample 
of the Cartesian product of the original tables. 

Proof. The CreateSample procedure chooses from each table a random sample of t tuples and adds to each sampled 
tuple an index in [l,t]. Each sample table has exactly one tuple with each index value, and the Cartesian product of 
the sample tables has exactly one element that is a concatenation of tuples, all with the same index i in their tables. 
Restricting the selectivity computation to these t elements (as in Compute Selectivity) gives the result. □ 



Note that our method circumvent the major difficulty pointed out by Chaudhuri et al. 1 1999ll . They also proved 



that, in general, it is impossible to predict sample sizes for given two tables such that the join of the samples of two 
tables will result in a sample of a required size out of the join of the two tables. Our method does not require a sample 
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Algorithm 1: CreateSample(s, (Ti, . . . , Tk)) 
input : sample size s, tables 7i, Tk- 
output: sample tables Si, ...,Sk with t tuples each. 

1 for j ^ 1 to /c do 

2 I 5j ^ 

3 end 

4 for z <^ 1 to s do 

5 for j <— 1 to fc do 

6 f <— drawRandomTuple (7j ) 

7 s.sampleindexj <— i 

8 Sj ^ U {t} 

9 end 

10 end 



Algorithm 2: ComputeSelectivity(5, op) 


input : database operation op, sample database S = {Si , . . . 


, Sk) of size s. 


output: the selectivity of op. 




1 Oop executeOperation {S, op) ; 




2 {£i, . . . ,£j) indexes of the sample tables involved in op ; 




3 1^0; 




4 for tuple e Oop do 




5 1 if tuple. sampleindexi-^ — tuple. sampleindexir^ = • ■ • 


= tuple. sampleindexi- then i -s— i + 1; 


6 end 




7 selectivity i / s; 





of a given size from the result of a join. The VC-dimension sampling technique requires only a sample of a given size 
from the Cartesian product of the tables, which is guaranteed by the above procedure. 

Identifying the optimal query plan during query optimization may require executing several candidate query plans 
on the sample. A standard bottom-up candidate plan generation allows us to execute sub-plans once, store their results 
and reuse them multiple times as they will be common to many candidate plans. While the overhead of this execution- 
based selectivity estimation approach will still likely be higher than that of pre-computation based techniques (e.g., 
histograms), the reduced execution times of highly optimized plans enabled by better estimates, especially for complex 
and long-running queries, will more than compensate for this overhead. Thus, storing intermediate results that are 
common to several executions will speed up the total execution time on the sample. The significant improvement in 
the selectivity estimates in complex queries well compensates for the extra work in computing the selectivity estimates. 

6 Experiments 

This section presents the results of the experiments we run to validate our theoretical results and to compare our 
selectivity estimation method with standard techniques implemented in PostgreSQL and in Microsoft SQL Server. 

Goals. The first goal of the experiments is to evaluate the practical usefulness of our theoretical results. To assess 
this, we run queries on a large database and on random samples of it of different sizes. We use the selectivity of the 
each query in the random samples as an estimator for the selectivity in the large database with the adjustments for join 
operations, as described in the previous Section. We compute the error between the estimate and the actual selectivity 
to show that the thesis of Thm. [T]is indeed valid in practice. The use of a large number of queries and of a variety of 
parameters allows us to evaluate the error rate as a function of the sample size. We then compare our method with 
the commonly used selectivity estimation based on precomputed histograms (briefly described in Sect. 16. lb . We use 
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histograms with a different number of buckets to show that, no matter how fine-grained the histograms might be, as 
soon as the inter-column and intra-bucket assumptions are no longer satisfied, our approach gives better selectivity 
estimates. 

6.1 Selectivity Estimation with Histograms 

In many modern database systems, the query optimizer relies on histograms for computing data distribution statistics 
to help determine the most efficient query plans. In particular, PostgreSQL uses one-dimensional equi-depth (i.e., 
equal frequency buckets) histograms and a list of the most common values (MCV) for each column (of a database 
table) to compute optimizer statistics. The MCV information stores the most frequent N items (by default — 100) 
and their frequency for each column. The histograms (by default with 100 bins) are built for the values not stored in 
the MCV list. The selectivity of a constraint A = x, where A is a column and x is a value is computed from the MCV 
list if X is in the MCV list or from the histogram bin that contains a; if a; is not in the MCV list. The selectivity of 
a range constraint such as A < x is computed with information from both the MCV list and the histogram, i.e., the 
frequencies of the most common values less than x and the frequency estimate for A < x from the histogram will be 
added to obtain the selectivity. 

In PostgreSQL, the histograms and the MCV lists for the columns of a table are built using a random sample of the 
tuples of the table. The histograms and the MCV list for all columns of a table are based on the same sample tuples 
(and are therefore correlated). The sample size is computed for each column using a formula based on the table size, 
histogram size, and a target error probability developed by Chaudhuri et al. L1998il and the largest sample size required 
by the columns of a table is used to set the sample size of the table. 

Finally, the join selectivity of multiple constraints are computed using the attribute independence assumption: e.g., 
selectivities are added in case of an OR operator and multiplied for an AND operator Therefore, large selectivity 
estimation errors are possible for complex queries and correlated inputs. 

6.2 Setup 

Original tables. The tables in our large database were randomly generated and contain 20 million tuples each. 
There is a distinction between tables used for running selection queries and tables used for running join (and selection) 
queries. For tables on which we run selection queries only, the distributions of values in the columns fall in two 
different categories: 

• Uniform and Independent: The values in the columns are chosen uniformly and independently at random from 
a fixed domain (the integer interval [0, 200000], the same for all columns). Each column is treated independently 
from the others. 

• Correlated: Two columns of the tables contain values following a multivariate normal distribution with mean 
AI — /1I2.2 and a non-identity covariance matrix S (i.e., the values in the two different columns are correlated). 

The tables for join queries should be considered in pairs {A, B) (i.e., the join happens along a common column C of 
tables A and E). The values in the columns are chosen uniformly and independently at random from a fixed domain 
(the integer interval [0, 200000], the same for all columns). Each column is treated independently from the others. 

Sample tables. We sampled tuples from the large tables uniformly, independently, and with replacement, to build the 
sample tables. For the samples of the tables used to run join queries, we drew random tuples uniformly at random from 
the base tables independently and added a column sampleindex to each tuple such that each tuple drawn from the 
same base table has a different value in the additional column and with tuples from different tables forming an element 
of the sample (of the Cartesian product of the base tables) if they have the same value in this additional column, as 
described in Sect. 15.21 

For each table in the original database we create many sample tables of different sizes. The sizes are either fixed 
arbitrarily or computed using ([T]) from Thm. [T] The arbitrarily sized sample tables contain between 10000 and 1.5 
million tuples. To compute the VC-dimension-dependent sample size, we fixed e = 0.05, S = 0.05, and c = 0.5. 
The parameter d was set to the best bound to the VC-dimension of the range space of the queries we were running. 
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as obtained from our theoretical results. If we let m be the number of columns involved in the selection predicate of 
the queries and b be the number of Boolean clauses in the predicate, we have that d depends directly on m and b, as 
does the sample size s through ([TJ in Thm. [1] For selection queries, we used m = 1,2 and b — 1,2, 3, 5, 8, with the 
addition of the combination m = 5, 6 = 5. We run experiments on join queries only for some combinations of m and 
b (i.e. for m = 1 and b = 1,2,5,8) due to the large size of the resulting sample tables. Table[T]shows the sample sizes, 
as number of tuples, for the combinations of parameters we used in our experiments. 





Select 


Join 


m 


b 


VC-dim 


Sample size 


VC-dim 


Sample size 




1 


2 


1000 


4 


1400 




2 


4 


1400 


16 


3800 


1 


3 


6 


2800 


36 


7800 




5 


10 


2600 


100 


20600 




8 


16 


3800 


256 


51800 




2 


31 


6800 








3 


57 


12000 






2 


5 


117 


24000 








8 


220 


44600 






5 


5 


294 


59400 





Table 1: Sample Sizes 



Histograms. We built histograms with a different number of buckets, ranging from 100 to 10000. Due to limitations 
in PostgreSQL, incrementing the number of buckets in the histograms also increments the number of values stored in 
the MCV list. Even if this fact should have a positive influence on the quality of the selectivity estimates obtained from 
the histograms, our results show that the impact is minimal, especially when the inter-column independence and the 
intra-bucket uniformity assumptions are not satisfied. For SQL Server, we built the standard single-column histograms 
and computed the multi-column statistics which should help obtaining better estimations when the values along the 
columns are correlated. 

Queries. For each combination of the parameters m and b and each large table (or pair of large tables, in the case of 
join) we created 100 queries, with selection predicates involving m columns and b Boolean clauses. The parameters in 
each clause, the range quantifiers, and the Boolean operators connecting the different clauses were chosen uniformly 
at random to ensure a wide coverage of possible queries. 

6.3 Results 

Selection Queries. The first result of our experiments is that, for all the queries we run, on all the sample tables, the 
estimate of the selectivity computed using our method was within e (= 0.05) from the real selectivity. The same was 
not true for the selectivity computed by the histograms. As an example, in the case of to = 2, 6 = 5 and uniform 
independent values in the columns, the default PostgreSQL histograms predicted a selectivity more than e off from the 
real selectivity for 30 out of 100 queries. Nevertheless, in some of cases the histograms predicted a selectivity closer to 
the actual one than what our method predicted. This is especially true when the histogram independence assumption 
holds (e.g., for m — 2, b — 5 the default histograms gave a better prediction than our technique in 1 1 out of 100 cases). 
Similar situations also arise for SQLServer 

Since the selectivity estimated by the our method was always within e from the actual, we report the actual percent 
error, i.e. the quantity e% — ioo|p(^g^)~|^p('?)l where p{aq) is the predicted selectivity. We analyze the average and the 
standard deviation of this quantity on a set of queries and the evolution of these measures as the sample size increases. 
We can see from Fig.[T]and|2]that both the average and the standard deviation of the percentage error of the prediction 
obtained with our method decrease as the sample size grows (the rightmost plotted sample size is the one from Table[T] 
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Selectivity Prediction Error, Select, Uniform Independent Columns, m=2, b=5 
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Figure 1 : Select - Uniform Independent Columns - 



= 2, 6 = 5 



i.e., the one computed in ThmlT] More interesting is the comparison in those figures between the performance of the 
histograms and the performance of our techniques in predicting selectivities. When the assumptions of the histograms 
hold, as is the case for the data plotted in Fig.[T] the predictions obtained from the histograms are good. 
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Selectivity Prediction Error, Select, Correlated Columns, m=2, b=8 
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But as soon as the data are correlated (Fig.|2]l, our sampling method gives better predictions than the histograms 
even at the smallest sample sizes and keeps improving as the sample grows larger It is also interesting to observe how 
the standard deviation of the prediction error is much smaller for our method than for the histograms, suggesting a 
much higher consistency in the quality of the predictions. In Fig. |2]we do not show multiple curves for the different 
PostgreSQL histograms because increasing the number of buckets had very marginal impact on the quality of the 
estimates, sometimes even in the negative sense (i.e., an histogram with more buckets gave worse predictions than 
an histogram with less buckets), a fact that can be explained with the variance introduced by the sampling process 
used to create the histograms. For the same reason we do not plot multiple lines for the prediction obtained from the 
multi-columns and single-column statistics of SQL Server: even when the multi-column statistics were supposed to 
help, as in the case of correlated data, the obtained prediction were not much different from the ones obtained from 
the single-column histograms. 
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Join Queries. The strength of our method compared to histograms is even more evident when we run join queries, 
even when the histograms independent assumptions are satisfied. In our experiments, the predictions obtained using 
our technique were always within e from the real values, even at the smallest sample sizes, but the same was not 
true for histograms. For example, in the case of m = 1 and 6 = 5, 135 out of 300 predictions from the histograms 
were more than e off from the real selectivities. Figure [3] shows the comparison between the average and the standard 
deviation of the percentage error, defined in the previous paragraph, for the histograms and our method. The numbers 
include predictions for the selection operations at the leaves of the query tree. 
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Selectivity Prediction Error, Join, Uniform Independent Columns, m=1, b=1 
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Again, we did not plot multiple curves for histograms with a different number of buckets because the quality of the 
predictions did not improve as the histograms became more fine-grained. To understand the big discrepancy between 
the accurate predictions of our method and the wrong estimates computed by the histograms in PostgreSQL we note 
that for some join queries, the histograms predicted an output size on the order of the hundreds of thousands tuples but 
the actual output size was zero or a very small number of tuples. Observing the curves of the average and the standard 
deviation of the percentage error for the prediction obtained with our method, we can see that at the smaller sample 
sizes the quality of the predictions only improves minimally with the sample size. This is due to the fact that at small 
sizes our prediction for the join operation is very often zero or very close to zero, because the output of the query does 
not contain enough pairs of tuples from the sample of the Cartesian product of the input table (i.e. pairs of tuples with 
the same value in the sampleindex column). In these cases, the prediction can not be accurate at all (i.e. the error is 
100% if the original output contained some tuples, or 0% if the query returned an empty set in the large databases). 
As soon as the sample size grows more, we can see first a jump to higher values of the percentage error, which then 
behaves as expected, i.e., decreasing as the sample size increases. 

In Fig. [5] we also show a comparison between the percentage error of predictions obtained using our method in 
two different ways: the "theoretically correct" way that makes use of the number of pairs of tuples with the same 
value in the sampleindex column and the "practitioner" way which uses the size of the output of the join operation 
in the sample, therefore ignoring the sampleindex column. Recall that we had to add the sampleindex column 
because Thm. [T]requires a uniform sample of the Cartesian product of the input tables. As it is evident from Fig. [3] 
the "practitioner" way of predicting selectivity gives very good results at small sample sizes (although it does not 
offer theoretical guarant ees). T hese results are similar in spirit, although not equivalent, to the theoretical conclusions 
presented by Haas et al. Ill996ll in the setting of selectivity estimation using online sampling. 



7 Conclusions 



We develop a novel method for estimating the selectivity of queries by executing it on a concise, properly selected, 
sample of the database. We present a rigorous analysis of our method and extensive experimental results demonstrating 
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its efficiency and the accuracy of its predictions. 

Most commercial databases use histograms built on a single column, for selectivity estimation. There has also been 

significant research on improving the estimate using multidimensional histograms [ Bruno et al., 200 1 ; Poosala and lo annidisl 



19971: ISrivastava et all 120061: IWang and Sevcild l2003ll and jom synopses lAcharva et al.l.ll999ll . The main advantage 



of our method is that it gives uniformly accurate estimates for the selectivity of any query within a predefined VC- 
dimension range. Method that collect and store pre-computed statistics gives accurate estimates only for the relations 
captured by the collected statistics, while estimates of any other relation relies on an independence assumption. , 
which give probabilistic guarantees on the error of the predicted selectivity. 

To match the accuracy of our new method with histograms and join synopses one would need to create, for each 
table, a multidimensional histogram where the number of dimensions is equal to the number of columns in the tables. 
The space needed for a multidimensional histogram is exponential in the number of dimensions, while the size of our 
sample representation is almost linear in that parameter. Furthermore, to estimate the selectivity for join operations 
one would need to create join synopses for all pairs of columns in the database, again in space that grows exponential 
in the number of columns. 

It is interesting to note that the highly theoretical concept of VC-dimension leads in this work to an efficient and 
practical tool for an important data analysis problem. 
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